1 Pre-session reading

Time - 5 minutes

1.1 Recap

To briefly recap, last week we covered:

  • How to fit a GLM/M
  • How to decide on nested/crossed random effects
  • How to account for temporal autocorrrelations
  • How to visualise residuals (difference between predicted and observed)
  • How to tell if a model is fitting our data well
  • How to explore alternative models
  • How to compare models
  • How to interpret model summaries
  • How to plot the results of GLMMs

1.2 Faster code

One criticism of R is that it is (relatively - trying doing the sums by hand instead) slow. R is a high level programming language (i.e. it has a bunch of predefined functions and automatic data handling etc etc) which make it a great language to use and learn, but can (sometimes) make it a little slow when compared to loewer level languages (like C+). The creators of R have gone a long way to try and solve these issues, and indeed a lot of the R functions you use will actually call C code in the background (and indeed you can write in C in R if you really wanted to, like if you were writing a new package for fast data analysis).

However, whilst a lot of functions have been optimised to run fast, in reality the most likely reason your code is slow (and might take ages to run) is through not-best-practice coding on your part. So this worshop will be on writing fast R code, and - where neccesary - parellalising that code.

Before you start to optimise your code consider: do you actually need to?

This might sound silly but if it takes you 3 hours to optimise your code and saves you 3 minutes of run-time then there isnt much point in doing it! Consider optimising code when you are having to run the same piece of code over and over again, and thus where small savings in the speed of that code can make a big difference in the total run time.

Alternatively, just get a bigger/faster computer to run your code on!

That being said, here are some nifty methods for speeding up your R code.

2 Today’s practical

We are going to look at some easy ways to speed up your code. Hopefully this will be a nice break from all the statistics!

2.1 Methods for timing code

Time - 10 minutes

First off, we need to find out how fast our code is. R has a few different ways of timing code, there is a generic base function system.time which allows you to time expressions:

##long vector
x <- c(1:1e6)

##time a calculation
system.time(sin(x + x^2))
##    user  system elapsed 
##   0.019   0.001   0.020

The “elapsed” value gives us the time in seconds it took to run the code.

Note - we didnt have the values of our calculation returned to us! There is a way around that…

If we want to to include more complex code we can wrap the code in { } and put it into system.time:

##time a calculation
system.time({
  ##long vector
  x <- c(1:1e6)
  
  ##calculate the sin of the folliwing
  v <- sin(x + x^2)
  
  ##put it into a tibble
  data_out <- as_tibble(v)
  })
##    user  system elapsed 
##   0.026   0.003   0.030
##look at the data
data_out
## # A tibble: 1,000,000 × 1
##      value
##      <dbl>
##  1  0.909 
##  2 -0.279 
##  3 -0.537 
##  4  0.913 
##  5 -0.988 
##  6 -0.917 
##  7 -0.522 
##  8  0.254 
##  9  0.894 
## 10 -0.0442
## # … with 999,990 more rows

Now you can see we have saved the various components as objects, and can access them as normal.

An alternative (and more powerful) approach is via the microbenchmark library. This allows you to time one, or compare two or more expressions agains one another. It does this multiple times (as there is always some slight differences in how long even the same chunk of code takes to run if you run it multiple times). As above you can include more complex code in the { }, but it is often easier to save the code as a function and run that:

##load the microbenchmark package
library(microbenchmark)

##long vector
x <- c(1:1e6)
  
##approach 1
f_1 <- function(x){
  ##calculate the sin of the folliwing
  v <- sin(x + x^2)
  
  ##put it into a tibble
  data_out <- as_tibble(v)
  
  return(data_out)
}

##approach 2
f_2 <- function(x){
  ##calculate the sin of the folliwing
  ##without saving objects v and data_out
  return(as_tibble(sin(x + x^2)))
}

##compare the speeds of the two functions, 
##running each 500 times
benchmarks <- microbenchmark(f_1,
                             f_2,
                             times = 1000, 
                             unit = "s")

##look at the results
benchmarks
##Unit: seconds
## expr     min      lq       mean  median      uq      max neval
##  f_1 4.0e-08 4.2e-08 4.3639e-08 4.3e-08 4.4e-08 4.13e-07  1000
##  f_2 3.3e-08 3.6e-08 3.8964e-08 3.8e-08 4.0e-08 8.85e-07  1000

So we can see that f_2 is about 10% faster than f_1 (mean is lower, median is lower, etc) so if we were to repeat this calculation hundreds of times it would be sensible to use the f_2 function.

2.2 Concise code

Time - 15 minutes

The above example neatly highlights an important way we can simply and easily speed up our R code - by minimising the steps made by R to produce our results. This should be obvious, but is worth point out. In our first function:

##approach 1
f_1 <- function(x){
  ##calculate the sin of the folliwing
  v <- sin(x + x^2)
  
  ##put it into a tibble
  data_out <- as_tibble(v)
  
  return(data_out)
}

We take the vector x, and then calculate its sin() and save that asv, and then use v to make a tibble. This code works fine but is slower than the second function:

##approach 2
f_2 <- function(x){
  ##calculate the sin of the folliwing
  ##without saving objects v and data_out
  return(as_tibble(sin(x + x^2)))
}

In this second function we are taking far fewer steps and creating fewer objects (so this will be more memory and processor efficient) however this comes at a cost - readability. For a computer, both of code chunks are equally easy to read, but for a human the first (with its annotation and seperation of calculations) is far easier than the second. This is something that is worth considering when you are writing you scripts - is being more readable of greater importance than being slightly faster…


Task

Write the following chunks of code in a more concise way:

############################################
## example 1
############################################
## vector of values
x <- c(1, 2, 3, 4, 5, 6, 7)

sample(x, 2)

############################################
## example 2
############################################

data("ToothGrowth")

sub1 <- filter(ToothGrowth, len>10) 

sub2 <- filter(sub1, supp=="VC")

ggplot(sub2, aes(x=factor(dose), y=len))+geom_boxplot()+theme_minimal()

2.3 Vectorising

Time - 30 minutes

The good news is that a lot of R functions are already vectorised for you. “But wait” I hear you say, what is vectorising?

Well the easiest way to show this with an example. Say we want to add the following to vectors together:

3 4 5 2 

1 2 3 4

If we were to do this in a non-vectorised way, say by hand, the we would do something like the following:

3 + 1 = 4
4 + 2 = 6
5 + 3 = 8
2 + 6 = 6

To give our new vector of:

4 6 8 6

This isn’t what R does (thankfully) as this would be, as you can imagine, very slow. Instead R can do the following:

vec_1 <- c(3, 4, 5, 2) 

vec_2 <- c(1, 2, 3, 4) 

vec_1 + vec_2
## [1] 4 6 8 6

Looks the same, but what we have actually done here is this:

3 4 5 2 
+ + + +
1 2 3 4

i.e. we have added the first vector to the second vector as single objects, rather than doing the sums independantly for each of the numbers within the vectors. This might not sound like a big deal, but we are adding together very small vectors here (4 numbers in them) and the difference between the two approaches becomes very obvious when we start adding together bigger numbers.

As we said, many R functions are already vectorised, these include:

  • logical operators
    • / - *
  • matrix operators (e.g. multiplying/dividing one matrix by another)

2.3.1 Vectorised vs non-vectorised - an example using loops

Let’s demonstrate this by writing a function to do pair-wise addition and compare it a vectorised approach. We are going to use the for() function to write a loop to make this happen. Loops can be very useful ways of iterating through data. To make a loop in R we use the for() function. The for() function takes two arguments, the first argument makes a new object which here we have called i (the name doesnt matter, it could be “custard”), the second item should be a vector of the length we want to iterate along. for() will then repeat the code inside the {} brackets until the specified number of iterations has been completed, and at each iteration i will become the next value in the vector. The easiest way understand this is to see it in action:

## a vector of values
x <- 1:5

##loop along the length of x
for(i in 1:length(x)){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

You can see that above we have made a vevtor, and then the for() loop has run 5 times (specified by the 1:length(x)) and at each iteration of the loop it has used a new value for i taken from the vector 1:length(x). We can then use these loops for a multitude of different purposes.

Going back to our original question (vectorised vs non-vectorised code), we will use the for() loop to iterate along the two vectors, selecting the i-th value in each vector and adding them together:

##non-vectorised addition function
non_vec_addition <- function(x, y){
  ##define an object to store the values in
  z <- numeric(length(x))
  
  ##use a loop to add each element of the vectors together in turn
  ##and then save the output into the correct slot in the z vector
  for(i in 1:length(x)){
    z[i] <- x[i] + y[i]
  }
  ##return the result
  return(z)
}

we can then use this function to do non-vectorised addition of two vectors:

##addition of two vectors
non_vec_addition(vec_1, vec_2)
## [1] 4 6 8 6

To demonstrate how poewerful vectorisation is lets apply our new function to some bigger vectors, and time how long it takes to complete, and compare that to the vectorised version:

##two long vectors
long_vec_1 <- runif(10000000, 0, 1)
long_vec_2 <- runif(10000000, 0, 1)

##time the non-vectorised vs vectorised approaches
system.time({
  non_vec_addition(long_vec_1, long_vec_2)
  })
##    user  system elapsed 
##   0.879   0.017   0.896
system.time({
  long_vec_1+long_vec_2
  })
##    user  system elapsed 
##   0.005   0.000   0.005

system.time() is a really useful function for finding out how fast bits of your code are, and testing if you can speed them up. Here you can see that the non-vectorised version takes 0.076 seconds, whilst the vectorised takes a blistering 0.002 seconds. We can use {} to put in a more complex argument than say 1*4. The expression can go over multiple lines, as long as you close the {} again and close the function with a ). Clearly these are both small numbers, but this can all add up when you are doing large calculations, especially if you have to do those calculations hundreds of times!

2.3.2 Practically, what does this mean?

Well, fair question to ask. In practicality many of the functions you use will already be vectorised, so you don’t have to think about it. However, sometimes we might (for convenience or because we havent thought of a better way) use something like to following to sort data:

##make sure you load in the data.table library to access rbindlist()

##make a blank list 
save_data <- vector(mode = "list", 
                    length = length(unique(iris$Species)))

##add some additional data to to the iris data frame
for(o in 1:length(unique(iris$Species))){
  ##subset the full data set, making a new subset for each species in turn
  sub <- subset(iris, Species==unique(iris$Species)[o])
  
  ##add in a new column called "outlier"
  ##we will use this to identify possible outliers in the data
  ## we initially fill this with NAs
  sub$outlier <- NA
  ##then if the subset data is from the species "versicolor"...
  if(sub$Species[1]=="versicolor"){
    ##...and the value of the Petal.Length column is greater than 5
    ## then save an "outlier" not in the outlier column
    sub$outlier[which(sub$Petal.Length > 5)] <- "outlier"
  }
  ##add the data to our blank list
  save_data[[o]] <- sub
}

##bind the list togehter (using rbindlist from the data.table package)
save_data <- rbindlist(save_data)

##see if it worked:
filter(save_data, Species == "versicolor", Petal.Length > 5)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species outlier
## 1:            6         2.7          5.1         1.6 versicolor outlier

Ok, so clearly this code does WORK. But in so many ways this isnt “good” code - its slow, its hard to read, and because its so complex there is a higher chance of you making a coding error.

There is a better way to do this!


Task

Use a vectorised approach to speed up this process, and time how long it takes when compared to the above (hint - logical operators are vectorised!).


Ok so lets have a think about what we were trying to do with the original code. In essence we are trying to do a logical argument, so we can do the above with so much more simple code, and make it much faster:

## make a new column
iris$outlier <- NA

##using logical operators
iris$outlier[iris$Petal.Length>5 & iris$Species=="versicolor"] <- "outlier"

So in terms of practical applications to your code - try and use vectorised functions wherever possible, and try and simplify your code as much as possible (both for ease of reading and for speed).

2.4 Defining objects inside functions/loops/etc

When you define an object using the <- command we are allocating memory to a specific task:

x <- runif(10, 0, 1)

What do you think is wrong with doing the following:

for(h in 1:length(x)){
  ##calculate the sum of the standard deviation of x
  sd_x <- sum(sd(x))
  
  ##define a constant
  cons <- 3.1512
  
  ##add to the h'th item in x:
  x[h] <- x[h] + sd_x + cons
}

Task

Write a more efficent version of the above.


Again, our above code works fine, but every time the loop runs we are defining sd_x and our cons again, even though both of these can be defined outside of the loop. This takes extra time and is memory inefficent, so define them before the loop (or function, etc):

##calculate the sum of the standard deviation of x
sd_x <- sum(sd(x))
  
##define a constant
cons <- 3.1512
  
for(h in 1:length(x)){
  ##add to the h'th item in x:
  x[h] <- x[h] + sd_x + cons
}

Note - you will need to define my_second_func first as it is used inside my_func, and R reads from top to bottom - like you do!

2.5 Defining functions inside functions

Time - 10 minutes

Similarly to above, avoid defining functions inside other functions. I.e. don’t do this:

##define a function first 
my_func <- function(x, y){
  
  ##define a second function
  my_second_func <- function(x){
    return(x^3)
  }
  
  ##run the new function
  new_x <- my_second_func(x)
  
  ##return the new values
  return(list(new_x, y))
}

Do this:

##define a second function
my_second_func <- function(x){
  ##return a new value for x
  return(x^3)
}

##define a function first 
my_func <- function(x, y){
  
  ##run the new function
  new_x <- my_second_func(x)
  
  ##return the new values
  return(list(new_x, y))
}

2.6 Pre allocating memory

Time - 5 minutes

Unlike other coding languages you don’t need to preallocate memory in R, however doing so can significantly speed up your your code. For example:

##a blank object
results <- NULL

##a sequence to loop along
loop_nums <- 1:2000000

##time the loop
system.time({
  ##run a loop
  for(j in loop_nums){
    ##do a calculation
    calc <- sin(j + (j^2))
    
    ##bind out the results
    results[[j]] <-calc
  }
})
##    user  system elapsed 
##   1.600   0.066   1.667

Now the pre-allocated version:

##a sequence to loop along
loop_nums <- 1:2000000

##a blank list of length loop nums
better_results <- vector(mode = "list", 
                         length = length(loop_nums))

##time the loop
system.time({
  for(j in loop_nums){
    ##do a calculation
    calc <- sin(j + (j^2))
    
    ##bind out the results
    better_results[[j]] <- calc
  }
})
##    user  system elapsed 
##   0.304   0.018   0.323

Thats ~ 10 times faster! Magic.

2.7 Parallelisation

Time - 70 minutes

When the building blocks to R were developed, computers were typically single cored, however most mobile phones now how 6-8 cores, allowing operations to be carried out in parallel to speed up the processing of complex operations.

When we run R as we have done for the last 5 weeks wehave been using a single core (processor), however there are a number of ways to parallelise your code to allow it to run significant faster (approximatly time/number_cores). I find this particularly useful for running simulations, as I will demonstrate first, but a number of packages employ parallelised versions of single core functions to speed up things like complex model selection (harking back to our GLM/M workbook from last week). We willg go into that too below.

2.7.1 lapply() and mclapply()/parLapply()

First off lets build a simple three species Lotka-Volterra (LV) model in R. Hopefully you all already know the LV model. This is an example of an differential equation. We are going to use a three species version of this:

\[\displaystyle\frac{dR}{dt} = rR - \alpha N R\] \[\displaystyle\frac{dN}{dt} = f \alpha NR - qN\] Where R is the number of the basal prey species, N is the number of primary consumers, and M is the number of the top predators. The basal species R grows at rate r, alpha is the search efficiency/attack rate of the primary consumers, f is the conversion efficiency of R to new N, and q is the exponential decline rate of the N in the absence of prey (R).

Now lets code this in R:

## load the deSolve package for solving
## differential equations
library(deSolve)

## define a three species LV models
lv2 <- function(t, start, parms) {
    ##set the start values for each of the species
    ##basal species
    R <- start["R"]
    ## consumers
    N <- start["N"]
    
    ##allow R to look within parms for the values of r, a, e, etc
    with(as.list(parms), {
        ## dynamics of the resources
        dR <- r*R - a*N*R
        ## dynamics of the primary consumers
        dN <- f*a*N*R - q*N 
        ##return a list of the abundances of each species
        ##deSolve will then group these into a data frame
        list(c(dR, dN))
    })
}

##we will pass these two functions tell to the ODE solver 
##to tell it to set the abundance of either species to 0
##if the simulated abundance falls to less than 0 individuals
##i.e. we will simulate extinction of the species otherwise
##we will have some abundance values which are very very small
##and very very large
eventFun<-function(t,y,p){
  y <- c(0, 0)
  return (y)
}
rootfun <- function (t,y,p) {
  if(min(y)<1){
    y <- c(0, 0)
  }
  return(y)
}

##we will then put the above function into a wrapper function which includes
##the ODE solver function and some data reshaping and saving out parameter combinations:
lv2_sim <- function(start, parms, time){
  ##run the simulation
  sim <- as_tibble(lsoda(y = start, 
                          times = time, 
                          func = lv2, 
                          parms = parms,
                          ##pass the event function
                          events = list(func = eventFun, 
                                       ##yes we want to use a root 
                                       ##function
                                       root = TRUE, 
                                       ##tell lsoda to stop the
                                       ##simulation if the event
                                       ##is triggered
                                       terminalroot = 1),
                          ##and the root function
                          rootfun = rootfun))
  
  ##reshape the data to long format:
  longer_sim <- sim %>% pivot_longer(-time, 
                                      values_to = "abundance",
                                      names_to="species")
  ##number of years the time series ran for:
  longer_sim$sim_length<-max(longer_sim$time)
  
  ##add a column to allow us to easily split the results up later
  longer_sim$parms <- paste(names(parms), parms, sep="=", collapse = ",")
  
  ##make a list of the simulation results and the 
  ##parameters which made them
  res <- list("sim" = longer_sim,
              "parameters" = parms)
  return(res)
}

##make an object of the parameters of the model
##we will pass this to the ODE solver
parms <- c(r = 0.1,     #growth rate of resources
           a = 0.01,        #feeding rate of primary consumer on resources
           f = 0.01,        #efficinecy of feeding on resources
           q = 0.1      #intrinsic death rate of the primary consumer
           )

#define the parameters
start <- c(R = 100, 
           N = 10)

##set the length of the simulation (100 time steps)
## and the resolution (set size, in this case 0.1 time steps)
time <- seq(0, 100, 1)

##run the simulation
dd <- lv2_sim(start, parms, time)

##plot it
ggplot(dd$sim, aes(x=time, y=abundance, col=species)) + 
  geom_line() +
  scale_y_log10() + 
  theme_minimal()

Great! So we have a two species LV model up and running, but our parameter values are…uninspiring to say the least. One thing we might be really interested in is how our model behaves over a range of parameter values, rather than just the single ones mentioned above. i.e. we want to set up a set of simulations covering a range of possible parameter values. Let’s go about that:

##make a list where the sequence 0.01 to 1 is
##replicated once for each of the parameters in parms
parms_list <- rep(list(seq(0.01, 0.2, length.out = 5)), 
                  length(parms))

##use expand.grid() to make a data.frame of every possible 
##combination of the 8 parameters
all_var <- expand.grid(parms_list)

Task

Look at the top 6 rows of the data, and calcualte the dimensions of the data.frame


##look at the data
head(all_var)

## the dimensions of the data
dim(all_var)

Then we will turn this into to a list where each object in the list is 1 row of the data frame:

##convert that into a list, where each object is 1 row of the data frame
all_var_list <- as.list(as.data.frame(t(all_var)))

Ok great, we have created a list with all of the possible combinations of our 4 parameters save as individual objects - a total of 625 simulations. Now we want to pass each of these combinations of parameters to our model in turn. Because we have a list, we can use the lapply() function to apply a function to a list (durr!). This is a really convenient way of running sumulations

##save out the time it takes to run
time_single_core <- system.time({
      ##the simulations
      res <- lapply(all_var_list, function(x, name_vars, start, time){
        try({
              ##add names to the parameters selected from the list
              names(x)<-name_vars
              ##pass the arguments to the function and run the simulation
              dd <- lv2_sim(start = start, 
                            parms = x, 
                            time = time)
              ##return the data we want:
              return(dd)
        }, silent = TRUE)
            ##any additional objects we want to pass to lapply
            ## listed between the last }  and the last )
            },  name_vars = names(parms),
                start = start, 
                time = time)
  })
##how long did it take?
time_single_core
##    user  system elapsed 
##   2.397   0.053   2.451

Ok so to run these simulations took around 7 seconds on my computer - remember this is for 5 values of 4 parameters (i.e. a very low number of simulations), and this model is deterministic so we don’t need to run multiple realisations! The problem with doing simulations in this ways is that because we are doing a fully factorial set of simulations (i.e. every possible combination) then adding in any new parameters/parameter values quickly explodes the number of simulations to infinity:

The above clearly demonstrates that parallel computing becomes much more important as you get more complex simulations, so lets see if we can speed up our simulations.

The parallel library has a nifty funtion for helping us here, so first off install the package and load it.

First off its useful to know how many cores our computers have:

##show the number of cores
n_cores <- detectCores()
n_cores
## [1] 8

We can then use the parallelised version of the lapply function we used above to run the simulations again

Unfortunatley this is where Windows and Mac differ, so its a choose your own adventure depending on your operating system…

2.7.1.1 MacOS

For macOS:

##save the time out
time_multi_core <- system.time({
      ##run the simulations
      res <- mclapply(all_var_list, function(x, name_vars, start, time){
        try({
              ##add names to the parameters selected from the list
              names(x) <- name_vars
              ##pass the arguments to the function and run the simulation
              dd <- lv2_sim(start = start, 
                            parms = x, 
                            time = time)
              ##return the data we want:
              return(dd)
        })
            ##any additional objects we want to pass to lapply
            ## listed between the last }  and the last )
            },  name_vars = names(parms),
                start = start, 
                time = time, 
                ## here we can specify the number of cores we want to use
                ## I always use n.cores-1 so the computer doesn't slow 
                ## down too much
                mc.cores = n_cores - 1)
  })

2.7.1.2 For Windows

And for Windows, you need to do a little extra work. First off to set up a cluster (an object defning your number of cores and initiating it):

##make the cluster
cl <-makeCluster(n_cores-1, type="PSOCK")

##then export any functions we have defined to the cluster so we can use them in the simulations
clusterExport(cl, c("lv2_sim", 
                    "eventFun", 
                    "rootfun", 
                    "lv2"), 
              envir=environment())

And then we can continue as before, but instead using parLapply():

##save the time out
time_multi_core <- system.time({
  ##run the simulations
  res <- parLapply(all_var_list, function(x, name_vars, start, time){
    ##NOTE need to load in all the packages to each node on the cluster, so:
    library(deSolve)
    library(tidyverse)
    try({
      ##add names to the parameters selected from the list
      names(x) <- name_vars
      ##pass the arguments to the function and run the simulation
      dd <- lv2_sim(start = start, 
                    parms = x, 
                    time = time)
      ##return the data we want:
      return(dd)
    })
    ##any additional objects we want to pass to lapply
    ## listed between the last }  and the last )
  },  name_vars = names(parms),
  start = start, 
  time = time, 
  ## specify which cluster you are running it on, i.e. the one we defined above (cl)
  cl = cl)
})

Great, lets compare the times.


Task

Calculate the ratio of time taken using a single core to that taken using multiple cores, from the times we calculted above


##divide
time_single_core["elapsed"]/time_multi_core["elapsed"]
##  elapsed 
## 2.853318

So you can see that the multi core version is around 3 times faster than the single core version.

Let’s plot out out results to have a look at what our model has been up to:

##look at the first object of the results
res[[1]]
## $sim
## # A tibble: 134 × 5
##    time      species abundance  sim_length parms                      
##    <deSolve> <chr>   <deSolve>       <dbl> <chr>                      
##  1 0         R       100.000000       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  2 0         N        10.000000       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  3 1         R        91.394497       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  4 1         N         9.995633       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  5 2         R        83.536677       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  6 2         N         9.983051       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  7 3         R        76.366937       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  8 3         N         9.963002       65.8 r=0.01,a=0.01,f=0.01,q=0.01
##  9 4         R        69.828937       65.8 r=0.01,a=0.01,f=0.01,q=0.01
## 10 4         N         9.936186       65.8 r=0.01,a=0.01,f=0.01,q=0.01
## # … with 124 more rows
## 
## $parameters
##    r    a    f    q 
## 0.01 0.01 0.01 0.01
##bind the results into a data frame
extract_res <- rbindlist(lapply(res, function(h){
  ##extract and return the simulation values 
  return(h$sim)
}))

##plot the results
ggplot(extract_res, aes(x = time, 
                        y = abundance,
                        group = parms)) +
  geom_line(col = "white", alpha=0.3) +
  facet_wrap(~species) +
  ##log the y axis
  scale_y_log10() +
  ##dark theme
  theme_dark() 

You can see that for some of the simulations the abundance of one of the two species drops below 1 (which would indicate an extinction of one species in the real world), so lets just plot the simulations which run till the end of the 100 time steps.


Task

Plot the data as above but only plot those simulations which reach 100 time steps (i.e. recreate the plot below). Hint - look at the results to see how you might be able to filter that data.


##plot only those simulations which run till the end of the 
##100 time steps
ggplot(filter(extract_res, sim_length == 100),
       aes(x = time, 
           y = abundance,
           group = parms)) +
  geom_line(col = "white", alpha=0.4) +
  facet_wrap(~species, scales = "free") +
  scale_y_log10() +
  theme_dark()
## Don't know how to automatically pick scale for object of type deSolve/matrix. Defaulting to continuous.


Bonus task - code a three species Lotka-Volterra model

The three species L-V model will look like this:

\[\displaystyle\frac{dR}{dt} = rR - \alpha N R\] \[\displaystyle\frac{dN}{dt} = f \alpha NR - qN - eNP\] \[\displaystyle\frac{dP}{dt} = zeNP -bP\]

Where R is the number of the basal prey species, N is the number of primary consumers, and P is the number of the top predators. The basal species R grows at rate r, alpha is the search efficiency/attack rate of the primary consumers, f is the conversion efficiency of R to new N, and q is the exponential decline rate of the N in the absence of prey (R). The top predator (P) eats the intermediate predator (N) at a rate defined by the search efficiency/attack rate e of P on N. The population of P then grows in the same way N does, as a function of z (the conversion efficiency of N’s to new P’s), the attack rate (e), and the number of N and P, whilst the population declines exponentially at a rate b.


2.7.2 Parallel for() loops

We used lapply() above to in effect loop through a bunch of parameter values and run the model. There is also a parallelised version of for() which we can use to achieve a similar thing - foreach(). foreach() is more complicated to impliment than lapply()/mclapply() but has its advantages as it is more flexible in terms of outputs (lapply/mclapply return a list, which you can then work on e.g. by using data.table::rbindlist()) and whether the code is run in parallel or not.

The basic form of foreach() is similar to that of the for() loops we used above - we define an object to iterate over, and give each iteration of that object a name (in the below example i) such that at each iteration i is a different value from our object. We then tell foreach() to execute some code, and can specify whether this is carried out in parallel or not. First off, lets look at a non-parallelised version:

##load the foreach 
library(foreach)

##an object to iterate over
our_seq <- 1:3

##Not parallelised
foreach(i = our_seq) %do% {
  ##some calculation
  sin(i/i^2)
}
## [[1]]
## [1] 0.841471
## 
## [[2]]
## [1] 0.4794255
## 
## [[3]]
## [1] 0.3271947

So you can see that the code is easy to understand. How about a parallelised version?

To do this we need to do a bit of pre-amble to tell foreach() how to deal with the parallelisation. Using makeCluster() we make a series of copies of R running in the background, and set the number to detectCores() - 1.

##load the doSNOW package
library(doSNOW)

##make a cluster, defining the number of cores to use
clus  <- makeCluster(detectCores() - 1)

##register the cluster with doParallel
##you only need to do this once, then you can use it for the rest
##of your script
registerDoSNOW(clus)

##an object to iterate over
our_seq <- 1:30

##then you can run a parallelised version of of our foreach() loop
looped_result <- foreach(i = our_seq) %dopar% {
  sin(i/i^2)
}

##look at the output
head(looped_result)
## [[1]]
## [1] 0.841471
## 
## [[2]]
## [1] 0.4794255
## 
## [[3]]
## [1] 0.3271947
## 
## [[4]]
## [1] 0.247404
## 
## [[5]]
## [1] 0.1986693
## 
## [[6]]
## [1] 0.1658961

As we said earlier one of the advantages of foreach is its flexibility (see ?foreac()). One nice option is the ability to directly output different types of object (rather than just a list as with mclapply()). For example, below we return a tibble using rbind. We need to load any packages into the environments created by makeCluster() as these are created empty (i.e. without packages), so we use the .packages argument to load any required packages into the newly created local environments:

## combine into a tibble
## see ?rbind for details (hint - it means row bind)
new_tib <- foreach(i = our_seq, 
                   .combine = 'rbind',
                   .packages = "tidyverse") %dopar% {
  ##return a tibble object with sin of i
  ##and the value of i
  return(tibble("sin" = sin(i/i^2), 
                   "i" = i))
}

## look at new_tib
new_tib
## # A tibble: 30 × 2
##       sin     i
##     <dbl> <int>
##  1 0.841      1
##  2 0.479      2
##  3 0.327      3
##  4 0.247      4
##  5 0.199      5
##  6 0.166      6
##  7 0.142      7
##  8 0.125      8
##  9 0.111      9
## 10 0.0998    10
## # … with 20 more rows

The above is particularly useful if you are - say - looping through replicates of a simulation and might want to save the replicate number (i) as part of the output to discerne between replicate simulations.

For particularly large/long simulations (minutes, hours, days) we might want to check how far the simulation is through, and we can do this by adding a progress bar via .options.snow:

##make a progress bar, we need to set the maximum to the 
##maximum length of the sequence we are iterating across
pb <- txtProgressBar(max = max(our_seq), style = 3)
## 
  |                                                                            
  |                                                                      |   0%
##make a function to pass the progress bar to the clusters
progress <- function(n) setTxtProgressBar(pb, n)
##and set the options for snow to include a progress bar
opts <- list(progress = progress)

##load in the options using .options.snow
new_tib <- foreach(i = our_seq, 
                   .combine = 'rbind',
                   .packages = "tidyverse",
                   .options.snow = opts) %dopar% {
  ##return the output
  return(data.frame("sin" = sin(i/i^2), 
                   "i" = i))
}
## 
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%

There is also a very useful .errorhandling option in foreach() allowing you to specify if an error is generated what to do about it (e.g. pass over it, bear this in mind for later).

3 Key Concepts

Let’s quickly summarise what we have learned today:

  • methods for timing code
  • concise vs readable coding
  • how to use vectorised functions to speed up code
  • the do’s and don’ts of defining objects
  • why you should preallocate memory
  • how to run parallel operations in R

4 Functions covered today

Function/operator Call
Time an operation system.time()
Compare speed of 1 or more functions including replicates microbenchmark()
Find the number of cores on your computer detectCores()
Parallelised version of lapply mclapply()
Dark theme for ggplot theme_dark()
Initlise loop using foreach package foreach()
Make a progress bar txtProgressBar()

6 Extra reading

The excellent R book “Efficent R programming” is available for free here

7 Homework

7.1 Homework solution

Related to the issues of fitting GLMs and GLMMs you had homework to try and identify the error distributions, link functions, and significant predictor variables for 4 data sets in the Bioinformatics repository.

So, to give you the answers:

Data Error distribution Link function Significant predictor variables
data 1.csv Gaussian inverse x1, x2
data 2.csv Gamma log x1, x2, x3
data 3.csv Binomial logit x1
data 4.csv Poisson log x1, x2

You might have done the right code, but not got the answers above - this is the vagaries of analysing data! even when we know the distributions (and the data are simulated, so not really noisy) then it can be hard to get the “right” answer. Below is the code I used to generate these data, so have a play generating different numbers of predictor variables and amounts of data to see how much data you need to find the correct pattern and distribution.

##load GlmSimulatoR
library(GlmSimulatoR)

###generate data with an inverse guassian distribution
##N is the number of data points, 
#the weights are the strength of the predictor variables effect on the y variable
## see ?simulate_inverse_gaussian for more detail
sim_gaussian_inv <- simulate_inverse_gaussian(N = 150, 
                                              unrelated = 2, 
                                              weights = c(5, 5, 0, 0))
##set the names of the columns
names(sim_gaussian_inv) <-c("y", paste("x", seq(1:(ncol(sim_gaussian_inv)-1)), sep=""))

mod1 <- glm(y~x1+x2+x3+x4, data=sim_gaussian_inv, family=gaussian(link="inverse"))
summary(mod1)
## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = gaussian(link = "inverse"), 
##     data = sim_gaussian_inv)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.110341  -0.050723  -0.008114   0.038522   0.221212  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.1012     1.0444   3.927 0.000133 ***
## x1            0.5794     0.3611   1.605 0.110727    
## x2            0.3289     0.3361   0.979 0.329387    
## x3           -0.2142     0.3548  -0.604 0.546894    
## x4           -0.4874     0.3380  -1.442 0.151450    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.003980511)
## 
##     Null deviance: 0.59829  on 149  degrees of freedom
## Residual deviance: 0.57717  on 145  degrees of freedom
## AIC: -396.36
## 
## Number of Fisher Scoring iterations: 6
##other possible error distributions to simulate:
  ##simulate_gamma()
  ##simulate_binomial()
  ##simulate_poisson()
  ##simulate_binomial()

Brilliant, hopefully you all found that helpful and managed to use the tools we learnt last week to detect the correct error distributions. This week, we will be talking about…

7.2 This week’s home

This week’s homework is to go back through the 6 workbooks you have done so far, and make sure everything makes sense to you and brush up on any bits you feel weak on. This revision session is aimed to prep you for the upcoming coursework…